A Multivariate Gaussian Mixture Model for Automatic Compound Word Extraction
نویسندگان
چکیده
An improved statistical model is proposed in this paper for extracting compound words from a text corpus. Traditional terminology extraction methods rely heavily on simple filtering-and-thresholding methods, which are unable to minimize the error counts objectively. Therefore, a method for minimizing the error counts is very desirable. In this paper, an improved statistical model is developed to integrate parts of speech information as well as other frequently used word association metrics to jointly optimize the extraction tasks. The features are modelled with a multivariate Gaussian mixture for handling the inter-feature correlations properly. With a training (resp. testing) corpus of 20715 (resp. 2301) sentences, the weighted precision & recall (WPR) can achieve about 84% for bigram compounds, and 86% for trigram compounds. The F-measure performances are about 82% for bigrams and 84% for trigrams. 1. Compound Word Extraction Problems 1.1 Motivation Compound words are very common in technical manuals. Including such technical terms in the system dictionary beforehand normally improves the performance of an NLP system significantly. In a machine translation system, for instance, the translation quality will be greatly improved if such unknown compounds are identified and included before the translation process begins. On the other hand, if a compound is not in the dictionary, it might be translated incorrectly [Chen 88]. For example, the Chinese translation of ‘green house’ is not the composite of the Chinese translations of ‘green’ and ‘house’. Furthermore, the number of parsing ambiguities will also increase due to the large number of possible parts of speech combinations for the individual words if such new compounds are unregistered. It will then reduce the accuracy rate in disambiguation, degrade the processing or translation quality and increase the processing time. In addition, for some NLP tasks, such as machine translation, a computer-translated manual is usually concurrently processed by several posteditors in practical operations. Therefore, maintaining the consistency of the translated terminologies among different post-editors is very important. If all the terminologies can be entered into the dictionary beforehand, the consistency can be automatically maintained, the translation quality can be greatly improved, and lots of post-editing time and consistency maintenance cost can be saved. Since compounds are rather productive and new compounds are created from day to day, it is impossible to exhaustively store all compounds in a dictionary. Furthermore, identifying the compounds by human inspection is too costly and time-consuming for a large input text. Therefore, spotting and updating such terminologies before translation without much human effort is important; an automatic and quantitative tool for extracting compounds from the text is thus seriously required. 1.2 Technical Problems in Previous Works The extraction problem can be modeled as a two-class classification problem, in which potential compound candidates are classified into either the compound class or the non-compound class. Many English or Chinese extraction issues had been addressed in the literature [Church 90, Calzolari 90, Bourigault 92, Wu 93, Smadja 93, Su 94b, Tung 94, Chang 95, Wang 95, Smadja 96]. Our focus will be on statistical methods for English compound word extraction, since statistical approaches have many advantages for large-scale systems in automatic training, domain adaptation, systematic improvement, and low maintenance cost. Most statistical approaches [Church 90, Smadja 93, Tung 94, Wang 95, Smadja 96] for terminology extraction rely on word association metrics, such as frequency [Wang 95, Smadja 96], mutual information [Church 90], dice metrics [Smadja 93] and entropy [Tung 94] to identify whether a group of words is a potential compound (or highly associated collocate). The mechanisms for applying such features are often based on simple filtering-and-thresholding statistical tests; a compound candidate will be filtered out (or classified as non-compound) if its association metric is below a threshold; when multiple features are available, the features are usually applied one-by-one independently with different heuristically determined thresholds. Such approaches can be implemented easily, and encouraging results were reported in various works. However, there are several technical problems with such filtering approaches. First of all, most simple word association features, such as frequency and mutual information, can only indicate whether an n-gram (i.e., a group of n words) is highly associated; however, high association does not always implies that it is a compound, since there are other syntactic (and even semantic) constraints which will also produce highly associated n-grams. For instance, the word pair "is a" has sufficiently high frequency of occurrence and high mutual information. Nevertheless, it is not a compound word since such a construct is produced due to syntactic reasons. Many long collocates extractable by such filtering methods are also of this category [Smadja 96]. Therefore, many highly associated non-compound n-grams might be mis-recognized as compounds. Although it is known that syntactic information is useful in resolving such problems, there are few works for integrating high level syntactic or semantic features, such as parts of speech, with known word association metrics in a simple and effective way. A part of speech related metric is therefore proposed in this paper to formulate the syntactic constraints among the constituents of potential compound candidates. Such integration between word association metrics and syntactic constraints in a uniform formulation is important, since syntactic constraints are closely related to the generation of the compounds, and it is desirable to apply simple statistical tests based on such features, instead of using complicated syntactic processing. Second, since the association features are often applied independently for filtering even with multiple features available, it is impossible to jointly use all discrimination information to acquire the best system performance. For instance, by filtering out low frequency candidates and then filtering out candidates with low mutual information, we may filter out low frequency candidates which actually have high mutual information. If the filtering mechanism is based on both frequency and mutual information, the system performance is expected to be better. In fact, it is well known that the performance is usually improved if multiple features are jointly considered, instead of using a single feature or applying multiple features independently. Therefore, what is really important is an automatic approach which could combine all available features for acquiring the best performance in the extraction task. However, several factors must be carefully considered in order to enjoy the discrimination information provided by multiple association features. For instance, many features proposed in the literatures are highly correlated. Therefore, the correlations among the association features must be included into the statistical model in order to acquire the best achievable performance. In this work, we will therefore use (a mixture of) multivariate Gaussian density functions to incorporate the effects of the inter-feature correlation. Furthermore, it is desirable to use only the most discriminative features and reject features that are either non-discriminative or redundant with respect to other more discriminative features when combining the features. In this paper we therefore propose an integrated method, which select the most appropriate features automatically, for combining a set of useful features. In particular, optimization based on frequency, mutual information, dice metric, contextual entropy and parts of speech information will be surveyed. To sum up, current terminology extraction researches do not fully exploit techniques for (1) integrating high level syntactic information in a simple and effective way, (2) combining useful features jointly for discrimination. To attack such problems, the parts of speech information, which encodes syntactic constraints, is integrated with several known word association metrics in one unified scoring mechanism. The correlations among the features are taken into consideration in designing the classifier. A feature selection mechanism is used for incorporating as many discriminative and non-redundant features as possible so that the terminology extraction task is based on the joint observations of the most discriminative features. A minimum error classifier, based on likelihood ratio test, is used as the basis for minimizing the classification error in the extraction task. In the following sections, we will therefore focus on the general issues to design a good minimum error classifier, which jointly considers a set of association features for achieving minimum classification error. The simulation result shows that the proposed approach gives promising results. The tool is also observed to be useful in cooperating with a machine translation system [Chen 91]. 2. Optimal Classifier Design 2.1 Optimization Criteria in Compound Extraction In a compound retrieval task, it is desirable to recover from the corpus as many real candidates as possible; in addition, the extracted compound word list should contains as little ‘false alarm’ (i.e., incorrect candidates) as possible. The ability to extract real candidates in the corpus is defined in terms of the recall rate, which is the percentage of real compounds that are extracted to the compound list by the classifier; on the other hand, the ability to exclude false alarm from the extracted compound list is defined in terms of the precision rate, which is the percentage of real compounds in the extracted compound list. Let nαβ be the number of class-α input tokens which are classified as class-β (α, β = 1 for compound, and 2 for non-compound, respectively), and, let n1 represent the number of real compounds in the corpus. The precision p and recall r are defined as follows: p = n11 n11 + n21 r = n11 n11 + n12 = n11 n1 The precision and recall rates are, in many cases, two contradictory performance indices especially for simple filtering approaches. When one of the performance index is raised, another index might degrade. To make fair comparison in performance, a joint performance indice or criterion function O(p,r) of the precision (p) and recall (r) rates is usually used to evaluate the system performance, instead of evaluating precision or recall alone. In the following sections, the weighted precision & recall (WPR) and the F-measure (FM) will be adopted as the optimization criteria. The weighted precision and recall (WPR), which reflects the average of these two indices, is proposed here as the weighting sum of the precision and recall rates: W P R w p:w r ( ) = w p ∗ p + w r ∗ r (w p + w r = 1) where w p , w r are weighting factors for precision and recall, respectively. The F-measure (FM) [Appelt 93, Hirschman 95, Hobbs 96], defined as follows, is another joint performance metric which allows lexicographers to weight precision and recall differently: F M β ( ) = β2 + 1 ( )p r β2p + r where β encodes user preference on precision or recall. When β is close to 0 (i.e., FM is close to p), the lexicographer prefers the system with higher precision; when β is large, the lexicographer prefers the system with higher recall. We will use Wp=Wr=0.5 and β=1, throughout this work, which means that no particular preference over precision or recall is imposed. If β=1, FM reduces to 2p r p + r , which appreciates the balance between precision and recall in the sense that equal precision and recall is most preferred if p + r is identical. With the optimization criteria defined, our goal is to design an optimal classifier which could maximize the WPR and FM. 2.2 Task Definition for Optimal Classifier Design Conventional extraction methods tend to use a list of word association related constraints for filtering out candidates of low likelihood based on certain word association metrics and empirical thresholds for the metrics. Unfortunately, there are no simple rules, other than trial-and-error, for such methods to acquire the optimal thresholds for acquiring the required precision or recall performance. In general, when the precision is raised by using high thresholds the recall degrades, and vice versa. The lexicographers could only use such tools by guessing. It is very difficult to automatically fit the lexicographers’ preference on the precision-vs-recall performance. Such difficulty can be resolved if we can design an optimal classifier for automatically maximizing the performance criterion, such as WPR or FM, which encode user preference in the pre-specified weights. The extraction problem can be regarded as a two-class classification problem in which each n-gram candidate is assigned either the compound label or the non-compound label based on the feature vector x associated with the candidate. To design a compound extractor is therefore equivalent to designing a discrimination function g(x;Λ) (which is capable of scoring how likely a candidate comes from the compound class), and using a set of decision rules to decide which n-gram candidate is a compound. (The symbol Λ refers to the parameters of the discrimination function, such as distributional means or variances of the probability density functions used in a statistical model.) Different discrimination functions and decision rules will classify the input candidates differently, and thus have different performance in terms of a performance criterion. Designing an optimal classifier for a particular criterion function is therefore equivalent to finding a partition of the feature space into the decision regions for the compound class and non-compound class; feature vectors belonging to the compound decision regions are classified as compound, otherwise, they are classified as non-compound. Our main task is therefore to design an optimal classifier (or equivalently the corresponding discrimination function g*O(p,r) (x;Λ)) which could maximize an objective criterion function O(p,r) of the precision (p) and recall (r) rates. 2.3 Optimal Classifier for Precision and Recall Optimization Given the underlying distributions, f(x|C) and f(x|C), of the feature vectors x in the compound class (C) and non-compound class (C), it is possible to estimate the error probabilities associated with any decision region (or equivalently, any threshold, decision rules or statistical tests which could be used to define such a region) for a class. Therefore, it is possible to design the optimal classifier for some simple criterion functions if the feature distribution is very simple. In fact, procedures for designing optimal classifiers, such as the minimum error classifier, had been well studied in the speech, communication and pattern recognition communities [Devijver 82, Juang 92]. For example, the decision rule that minimizes the expected probability of classification error turns out to be a likelihood ratio test in the 2-class classification case [Devijver 82]. However, since WPR and FM are non-linear functions of classification errors (i.e., a non-linear function of n12 and n21 ), it is hard to find a simple analytical discrimination function g*O(p,r)(x;Λ) for testing whether an n-gram is a compound, such that the joint performance O(p,r) is maximum. Therefore, a two stage optimization scheme is proposed here in order to optimize a user specified criterion function of precision and recall, while retaining a small error rate. In the first stage, a minimum error classifier, g* e(x;Λ), (which satisfies the minimum error criterion) is used as the base classifier to minimize the error rate (e) of classification. In the second stage, a learning method is applied, starting from the minimum error status, to optimize a user-specified criterion function of the recall and precision rates by adjusting the parameters of the classifier according to mis-classified instances. Figure 1 shows the block diagram for training such a classifier. In the training flow, the n-grams in the training text corpus are extracted and manually inspected; those real compounds within the text corpus are used to construct a compound dictionary. The feature vectors associated with the n-grams are divided into the compound and non-compound classes according to the compound dictionary. The parameters for the compound class (Λc ) and non-compound class (Λc ) are estimated from the distributions of the two classes. The training n-grams are then classified by the minimum error classifier. The result is compared with the compound dictionary afterward. Those misclassified n-grams are then used to adjust the parameters iteratively so that the criterion function is maximized. The first optimization stage serves to determine the appropriate thresholds (or, more precisely, the decision boundaries) in the feature space so that as little misclassification is attained as possible. In this way, the precision and recall are expected to be improved indirectly. The second stage, on the other hand, adjusts the parameters of the classifier to achieve a local optimum of the joint precision-recall performance, starting from the minimum error status, instead of optimizing the precision and recall from arbitrary decision boundary. In other words, we are not trying to find some simple analytical discrimination function which are capable of identifying the optimal decision boundaries for precision-recall optimization. Instead, we first establish reasonably optimized decision boundaries by using the simple discrimination function for the minimum error classifier, and then modify the decision boundaries by changing the parameters of the distribution functions of the minimum error classifier to maximize the joint precision-recall performance. Λ = Λc,Λc { } Text Corpus n-grams
منابع مشابه
Automatic Validation of Terminology Translation Consistency with Statistical Method
This paper presents a novel method to automatically validate terminology consistency in localized materials. The goal of the paper is two-fold. First, we explore a way to extract phrase pair translations for compound nouns from a bilingual corpus using word alignment data. To validate the quality of the extracted phrase pair translations, we use a Gaussian mixture model (GMM) classifier. Second...
متن کاملText Independent Speaker Identification with Finite Multivariate Generalized Gaussian Mixture Model with Distant Microphone Speech
An effective and efficient speaker Identification (SI) system requires a robust feature extraction module followed by a speaker modeling scheme for generalized representation of these features. In recent, years Speaker Identification has seen significant advancement, but improvements have tended to be bench marked on the near field speech, ignoring the more realistic setting of far field instru...
متن کاملTITLE: ROBUST, AUTOMATIC SPIKE SORTING USING MIXTURES OF MULTIVARIATE t-DISTRIBUTIONS
1 Abstract A number of recent methods developed for automatic classification of multiunit neural activity rely on a gaussian model of the variability of individual waveforms and the statistical methods of gaussian mixture decomposition. Recent evidence has shown that the gaussian model does not accurately capture the multivariate statistics of the waveform samples' distribution. We present furt...
متن کاملDetection of Shouted Speech in the Presence of Ambient Noise
This study focuses on the detection of shouted speech in realistic noisy conditions. An automatic system based on modified mel frequency cepstral coefficient (MFCC) feature extraction and Gaussian mixture model (GMM) classification is developed. The performance of the automatic system is compared against human perception measured by a listening test. At moderate noise levels, the automatic syst...
متن کاملNegative Selection Based Data Classification with Flexible Boundaries
One of the most important artificial immune algorithms is negative selection algorithm, which is an anomaly detection and pattern recognition technique; however, recent research has shown the successful application of this algorithm in data classification. Most of the negative selection methods consider deterministic boundaries to distinguish between self and non-self-spaces. In this paper, two...
متن کاملRobust, automatic spike sorting using mixtures of multivariate t-distributions.
A number of recent methods developed for automatic classification of multiunit neural activity rely on a Gaussian model of the variability of individual waveforms and the statistical methods of Gaussian mixture decomposition. Recent evidence has shown that the Gaussian model does not accurately capture the multivariate statistics of the waveform samples' distribution. We present further data de...
متن کامل